Distinct

distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame.

val Data = Seq(
                        ("James", "Sales", 3000),
                        ("Michael", "Sales", 4600),
                        ("Robert", "Sales", 100),
                        ("Maria", "Finance", 3000),
                        ("James", "Sales", 3000),
                        ("Scott", "Finance", 3300),
                        ("Jen", "Finance", 3900),
                        ("Jeff", "Marketing", 3000),
                        ("Kumar", "Marketing", 2000),
                        ("Saif", "Sales", 4100))
val df = Data.toDF("employee_name", "department", "salary")

Select department name form DataFrame
df.select("department").show()

Select unique records  form DataFrame
We can use distinct() function to remove the duplicate rows of a DataFrame and get the DataFrame which won’t have duplicate rows.

df.select($"department").distinct().show


You can also use dropDuplicates to get unique values,
We can use dropDuplicates operation to drop the duplicate rows of a DataFrame and get the DataFrame which won’t have duplicate rows.

df.select($"department").dropDuplicates().show


Count the unique records from  DataFrame
#pyspark
Data = [           ("James", "Sales", 3000), \
                        ("Michael", "Sales", 4600), \
                        ("Michael", "Sales", 4600), \
                        ("Michael", "Sales", 4600), \
                        ("Robert", "Sales", 100), \
                        ("Maria", "Finance", 3000), \
                        ("James", "Sales", 3000), \
                        ("Scott", "Finance", 3300), \
                        ("Jen", "Finance", 3900), \
                        ("Jen", "Finance", 3900), \
                        ("Jeff", "Marketing", 3000), \
                        ("Kumar", "Marketing", 2000), \
                        ("Kumar", "Marketing", 2000), \
                        ("Kumar", "Marketing", 2000), \
                        ("Saif", "Sales", 4100)]
columns= ["empno","job", "sal"]
df = spark.createDataFrame(data = Data, schema = columns)
df.count()

df.count()
Out[26]: 15

df.distinct().count()
Out[27]: 9






No comments:

Post a Comment